Step-by-step workflow on Zeisel dataset

Package loading

Before performing the analysis, several R-based packages should be installed on your local machine. Note that the packages may require an additional installation of other dependencies. Make sure that you have installed recommended version of the packages and your machine meets the mimimal system requirements for carrying out this analysis.

library(scater)
library(M3Drop)
library(monocle)
library(Seurat)
library(mclust)
library(scran)
library(SC3)

Data information

Zeisel2015 is a UMI-based scRNAseq dataset published in Zeisel et al. (2015) (Link to study: http://science.sciencemag.org/content/347/6226/1138) that contains the expression matrix of endogenous, spike-in and mitochondrial genes from 3005 cells from mouse cortex and hippocampus tissues. Additional metadata information are available, including cells origin, type or tissue. For the purpose of this tutorial we stored Zeisel2015 count matrix with metadata information in two separate txt files. We can read these files and create an easy-to-work SingleCellExperiment experiment object.

all.counts <- read.table(file ="data/all.counts.txt",  sep = "\t")
metadata <- read.table(file ="data/metadata.txt", sep = "\t", header = TRUE)

sce <- SingleCellExperiment(list(counts=as.matrix(all.counts)), colData=metadata)
sce
## class: SingleCellExperiment 
## dim: 19896 3005 
## metadata(0):
## assays(1): counts
## rownames(19896): Tspan12 Tshz1 ... ERCC-00170 ERCC-00171
## rowData names(0):
## colnames(3005): V3 V4 ... V3006 V3007
## colData names(10): tissue group ... level1class level2class
## reducedDimNames(0):
## spikeNames(0):

We can explore the information about cells and genes stored in SingleCellExperiment experiment object under colData and rawData slots, respectively.

counts(sce)[1:4,1:4]
##          V3 V4 V5 V6
## Tspan12   0  0  0  3
## Tshz1     3  1  0  2
## Fnbp1l    3  1  6  4
## Adamts15  0  0  0  0
names(colData(sce)) #available metadata information
##  [1] "tissue"      "group"       "mRNA"        "well"        "sex"        
##  [6] "age"         "diameter"    "cellId"      "level1class" "level2class"
table(colData(sce)$tissue) #tissue annotation
## 
## ca1hippocampus       sscortex 
##           1314           1691
table(colData(sce)$level1class) #cell type annotation
## 
## astrocytes_ependymal    endothelial-mural         interneurons 
##                  224                  235                  290 
##            microglia     oligodendrocytes        pyramidal CA1 
##                   98                  820                  939 
##         pyramidal SS 
##                  399
names(rowData(sce))
## character(0)

As mentioned, Zeisel2015 dataset contains also the expression of spike-ins and mitochondrial genes. Spike-in gene names usually starts from “ERCC” and mitochondrial genes contain “mt” string.

is.spike <- grepl("^ERCC", rownames(sce))
summary(is.spike)
##    Mode   FALSE    TRUE 
## logical   19839      57
is.mito <- grepl("^mt-", rownames(sce))
summary(is.mito)
##    Mode   FALSE    TRUE 
## logical   19862      34

Quality control and data preprocessing

We can use scater package to automatically calculate several quality metrics per cell and per gene. These metrics will be useful to i.e. detect low quality cells or lowly expressed genes that should be removed from further analysis. Remember to not define spike-ins in feature_controls if your dataset do not contain any ERCC genes.

sce <- calculateQCMetrics(sce,  feature_controls=list(Spike=is.spike, Mt=is.mito))

Quality metrics per cell

colnames(colData(sce))
##  [1] "tissue"                                        
##  [2] "group"                                         
##  [3] "mRNA"                                          
##  [4] "well"                                          
##  [5] "sex"                                           
##  [6] "age"                                           
##  [7] "diameter"                                      
##  [8] "cellId"                                        
##  [9] "level1class"                                   
## [10] "level2class"                                   
## [11] "is_cell_control"                               
## [12] "total_features_by_counts"                      
## [13] "log10_total_features_by_counts"                
## [14] "total_counts"                                  
## [15] "log10_total_counts"                            
## [16] "pct_counts_in_top_50_features"                 
## [17] "pct_counts_in_top_100_features"                
## [18] "pct_counts_in_top_200_features"                
## [19] "pct_counts_in_top_500_features"                
## [20] "total_features_by_counts_endogenous"           
## [21] "log10_total_features_by_counts_endogenous"     
## [22] "total_counts_endogenous"                       
## [23] "log10_total_counts_endogenous"                 
## [24] "pct_counts_endogenous"                         
## [25] "pct_counts_in_top_50_features_endogenous"      
## [26] "pct_counts_in_top_100_features_endogenous"     
## [27] "pct_counts_in_top_200_features_endogenous"     
## [28] "pct_counts_in_top_500_features_endogenous"     
## [29] "total_features_by_counts_feature_control"      
## [30] "log10_total_features_by_counts_feature_control"
## [31] "total_counts_feature_control"                  
## [32] "log10_total_counts_feature_control"            
## [33] "pct_counts_feature_control"                    
## [34] "pct_counts_in_top_50_features_feature_control" 
## [35] "pct_counts_in_top_100_features_feature_control"
## [36] "pct_counts_in_top_200_features_feature_control"
## [37] "pct_counts_in_top_500_features_feature_control"
## [38] "total_features_by_counts_Spike"                
## [39] "log10_total_features_by_counts_Spike"          
## [40] "total_counts_Spike"                            
## [41] "log10_total_counts_Spike"                      
## [42] "pct_counts_Spike"                              
## [43] "pct_counts_in_top_50_features_Spike"           
## [44] "pct_counts_in_top_100_features_Spike"          
## [45] "pct_counts_in_top_200_features_Spike"          
## [46] "pct_counts_in_top_500_features_Spike"          
## [47] "total_features_by_counts_Mt"                   
## [48] "log10_total_features_by_counts_Mt"             
## [49] "total_counts_Mt"                               
## [50] "log10_total_counts_Mt"                         
## [51] "pct_counts_Mt"                                 
## [52] "pct_counts_in_top_50_features_Mt"              
## [53] "pct_counts_in_top_100_features_Mt"             
## [54] "pct_counts_in_top_200_features_Mt"             
## [55] "pct_counts_in_top_500_features_Mt"
colData(sce)$total_counts[1:5] #Total count for each cell
## [1] 29060 29304 38929 40557 27625
colData(sce)$total_features_by_counts[1:5] #Number of genes with non zero counts per cell
## [1] 4898 4750 6090 5887 4753
if(length(is.spike==TRUE)>0){colData(sce)$pct_counts_Spike[1:5]} #Percentage of spike-in counts per cell
## [1] 23.07639 21.95946 16.27322 17.33856 21.46968

Quality metrics per gene

colnames(rowData(sce))
## [1] "is_feature_control"       "is_feature_control_Spike"
## [3] "is_feature_control_Mt"    "mean_counts"             
## [5] "log10_mean_counts"        "n_cells_by_counts"       
## [7] "pct_dropout_by_counts"    "total_counts"            
## [9] "log10_total_counts"
rowData(sce)$mean_counts[1:5] #Average count per gene
## [1] 0.27886855 0.42362729 1.04292845 0.04459235 0.37936772
rowData(sce)$n_cells_by_counts[1:5] #Number of cells with non zero counts per gene
## [1]  472  573 1167   77  669
rowData(sce)$pct_dropout_by_counts[1:5] #Percentage of dropouts per gene
## [1] 84.29285 80.93178 61.16473 97.43760 77.73710

Plot quality metrics

One can plot histograms based on quality metrics to determine possible thresholds for cell/gene filterings. We can also visualize some of the quality metrics on the PCA projected data.

assay(sce, "logcounts_raw") <- log2(counts(sce)+1) 

hist(sce$total_counts, breaks=50,  xlab="Library size", ylab="Number of cells")

plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="total_counts")

hist(sce$total_features_by_counts, xlab="Number of expressed genes", breaks=50,  ylab="Number of cells")

plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="total_features_by_counts")

if(length(is.spike==TRUE)>0)
{hist(sce$pct_counts_Spike, xlab="ERCC proportion (%)", ylab="Number of cells", breaks=50)
plotPCA(sce, run_args=c(exprs_values="logcounts_raw"), size_by="pct_counts_Spike")}

Automatic filtering of outlier cells based on quality metrics

Alternatively, isOutlier function provides an automatic detection of low quality cells based on quality metrics. In this example, we are filtering cells based on library size, number of expressed genes per cell and total count over all spike-in transcripts in each cell. Excluded cells are those with the total number of expressed genes and the total sum of counts more than 3 median absolute deviations below the median across the genes. Note that we dont filter many cells in this step, it is likely that dataset was already quality controlled by the original study.

libsize.drop <- isOutlier(sce$total_counts, nmads=3, type="lower", log=TRUE)
feature.drop <- isOutlier(sce$total_features_by_counts, nmads=3, type="lower", log=TRUE)
if(length(is.spike==TRUE)>0){spike.drop <- isOutlier(sce$pct_counts_Spike, nmads=3, type="higher")
ind <- libsize.drop | feature.drop | spike.drop}else{ind <- libsize.drop | feature.drop}

sce$removed=ind
plotPCA(sce, colour_by="removed", run_args=c(exprs_values="logcounts_raw"), shape_by="removed")

if(length(is.spike==TRUE)>0)
{sce <- sce[,!(libsize.drop | feature.drop | spike.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), BySpike=sum(spike.drop), Remaining=ncol(sce))
}else{sce <- sce[,!(libsize.drop | feature.drop)]
data.frame(ByLibSize=sum(libsize.drop), ByFeature=sum(feature.drop), Remaining=ncol(sce))
}
##   ByLibSize ByFeature BySpike Remaining
## 1         8         3       8      2989

Explore highly expressed genes

We can also explore highly and overexpressed genes. The overexpressed genes, are those genes which expression is higher than others by several magnitudes. Overexpressed genes can bias further procedures such as clustering, therefore, they should be removed before performing downstream analysis. Note that highly expressed genes mostly include spike-ins and mitochondrial genes. One of the most commonly overexpressed endogenous gene is “Malat1”.

plotHighestExprs(sce, exprs_values = "counts")

Filter overexpressed features

In this step we will filter spike-in and mitochondrial genes as well as “Malat1”.

if(length(is.spike==TRUE)>0){sce=sce[-which(grepl("^ERCC-", rownames(sce))),]}
sce=sce[-which(grepl("^mt-", rownames(sce))),]
sce <- sce[-which(rownames(sce)=="Malat1"),]
dim(sce)
## [1] 19804  2989

Filter lowly expressed genes based on the average count per gene

Now, we should filter out lowly expressed genes as they do not provide any insights into the underlying biology. In the filtering we removed lowly expressed genes that are genes with average expression count (adjusted by library size) equal to 0. Note that we dont filter many genes in this step, it is likely that dataset was already quality controlled by the original study.

rowData(sce)$ave.counts <- calcAverage(sce, exprs_values = "counts", use_size_factors=FALSE)
to.keep <- rowData(sce)$ave.counts > 0
sce <- sce[to.keep,]
summary(to.keep)
##    Mode   FALSE    TRUE 
## logical       2   19802
dim(sce)
## [1] 19802  2989

Perform standard normalization by computing counts per million (CPM)

After quality control and filtering, one should normalize the dataset to remove potential technical bias. One strategy is to use CPM (Count Per Million) normalization which is a correction to remove the noise related to sequencing depth. CPM divides each count by its total sum (across all the genes) and multiplies by one million. In this way, each cell has the same total sum of the counts. We can plot PCA on the log-transformed CPM normalized counts and compare with previous PCA on the log raw counts.

cpm(sce) <- calculateCPM(sce, use_size_factors=FALSE)
cpm(sce)[1:4,1:4]
##                V3       V4      V5        V6
## Tspan12    0.0000  0.00000   0.000  94.75081
## Tshz1    144.5226 47.89501   0.000  63.16720
## Fnbp1l   144.5226 47.89501 197.336 126.33441
## Adamts15   0.0000  0.00000   0.000   0.00000
assay(sce, "logcounts") <- log2(cpm(sce)+1) 
plotPCA(sce,colour_by="level1class",run_args=c(exprs_values="logcounts"))

set.seed(100)
plotTSNE(sce, colour_by="level1class", run_args=c(exprs_values = "logcounts"))

plotUMAP(sce, colour_by="level1class", run_args=c(exprs_values = "logcounts"))

Perform normalization using decnvolution method

Another strategy, more suitable for scRNAseq data, is deconvolution method implemented in scran package. The deconvolution method normalizes data by cells-pooled size factors that account for dropout biases. More details about this normalization technique can be found in study https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0947-7. Similarly, we will plot PCA on the log scran normalized counts.

set.seed(100)
clusters <- quickCluster(sce, min.size=200, min.mean=0.1, method="igraph")
sce <- computeSumFactors(sce, cluster=clusters, min.mean=0.1)
sce <- scater::normalize(sce, return_log = FALSE)
normcounts(sce)[1:4,1:4]
##                V3        V4       V5       V6
## Tspan12  0.000000 0.0000000 0.000000 1.278561
## Tshz1    1.894734 0.6425685 0.000000 0.852374
## Fnbp1l   1.894734 0.6425685 2.582446 1.704748
## Adamts15 0.000000 0.0000000 0.000000 0.000000
assay(sce, "logcounts") <- log2(normcounts(sce)+1) 
plotPCA(sce,colour_by="level1class",run_args=c(exprs_values="logcounts"))

Reducing dataset dimensionality

To deal with a large number of dimensions in scRNAseq dataset a feature selection or dimension reduction strategies can be applied.

Feature selection - to retrieve most variable genes

Feature selection step is aimed at preserving biologically relevant information in the dataset and improving computational efficiency of downstream analyses. In most of the cases, it seeks for highly variable genes based on mean/variance relationship. M3Drop is one of the packages for automatic detection of most variable genes. It first calculates the mean and square coefficient of variation and fits the quadratic curve between the two variables. Then a chi-square test is used to find high variable genes which are significantly above the curve. We found CPM adjusted counts more appriopriate to use as input for this method rather than scran normalized values. Note that when using older versions of M3Drop package (<1.10.0) the output is an array of genes instead of gene table with p-values.

hvg_table <- BrenneckeGetVariableGenes(cpm(sce), fdr = 0.01, minBiolDisp = 0.5)

length(hvg_table$Gene)
## [1] 8183
sce_sub=sce[as.character(hvg_table$Gene),]

Dimension reduction - to project data into low-dimensional embedding

Dimension reduction can be used to visualize basic structure of the data or to reduce the number of features prior downstream analysis such as clustering. In contrast to the feature selection which extracts the most informative features, dimension reduction techniques projects the data into a new low-dimensional space that preserves the structure of the data. To reduce dataset dimensionality one can use previously mentioned PCA or more novel techniques such as tSNE or UMAP (all implemented in scater package). Note that projections should be applied to normalized and log-transformed count matrices. PCA is a deterministic approach while tSNE and UMAP are stochastic - for reproducibility of tSNE and UMAP projections one has to set the seed for generating random variables.

assay(sce_sub, "logcounts") <- log2(cpm(sce_sub)+1) 
plotPCA(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))  

set.seed(100)
plotTSNE(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))

plotUMAP(sce_sub, colour_by="level1class", run_args=c(exprs_values = "logcounts"))

Clustering

Clusterize cells using SC3 package

To cluster cells we can use SC3 package. SC3 is a consensus clustering method for single-cell RNA-seq data. It first calculates distances between the cells using three metrics: Euclidean, Pearson, and Spearman. On each of the obtained distance matrix two transformations are applied: PCA and Laplacian graph. Then K-Means clustering technique is used to cluster the transformed distance matrices subject to the first d eigenvectors. In result, several individual clusterings are obtained which are further combined into a single consensus clustering using Cluster-based Similarity Partitioning Algorithm (CSPA). In this example we will use available cell annotation in the clustering inference of SC3. In the real life applications we usually do not know the exact number populations in the sample. Note that clustering cells with SC3 may take a substantial amount of time. We will visualize results on the PCA projected space.

k_input=length(unique(sce$level1class))
rowData(sce)$feature_symbol <- rownames(sce)
sce <- sc3(sce, ks = k_input, gene_filter = FALSE, biology = FALSE)
table(colData(sce)[paste0("sc3_",k_input, "_clusters")][[1]])
## 
##   1   2   3   4   5   6   7 
## 960   8 708  49 517 311 436
assay(sce, "logcounts") <- log2(cpm(sce)+1)
plotTSNE(sce, colour_by=paste0("sc3_",k_input, "_clusters"), run_args=c(exprs_values = "logcounts"))

Evaluate clustering output based on annotation

If the cell annotation is available, one can measure the effectiveness of the clustering output using Adjusted Rand Index (ARI). ARI index measures similarity between the partition obtained from clustering and the partition obtained from dataset annotation. The values of the ARI range can be negative if the agreement of the partitions is worse then the agreement expected by chance, or between 0 and 1 for clustering better then chance. ARI close to 1 indicate high accuracy of the method in detecting annotated cell populations.

adjustedRandIndex(colData(sce)[paste0("sc3_",k_input, "_clusters")][[1]],sce$level1class)
## [1] 0.7940186

Exercise

How the clustering performance would change if you would estimate the number of cell populations in SC3 using sc3_estimate_k function? Is the number of estimated clusters close to the annotated number of cell populations ?

Additional plots

You can use colour_by argument to color points in any of the above projections by a selected marker gene or any information stored in the annotation. Additional violin plots allow you to i.e. visualize the expression of the selected gene across all the cells from a given cell type. For the illustrative purpose, we used Gad1 gene taken from the literature, which is a marker for interneuron cell type in the cortex tissue.

set.seed(100)
plotTSNE(sce, colour_by="Gad1", run_args=c(exprs_values = "logcounts")) 

plotExpression(sce, "Gad1", x = "level1class", colour_by = "level1class", exprs_values = "logcounts") 

Complete data analysis using Seurat package

Seurat is an R package designed for QC, analysis, and exploration of single-cell RNA-seq data. All procedures are based on Seurat object which can store the information about counts and dataset annotation. Note that when creating new Seurat object one can perform cell and gene filterings based on specified thresholds. Here we set both filterings to 0, as we will input already quality controlled and filtered counts.

Note that in Seurat versions <3.1.1 slot counts is relaced by raw.data, slot min.features is replaced by min.genes, and instead of seur@assays$RNA@counts user access the data by calling seur@data.

Create Seurat object

seur <- CreateSeuratObject(counts = counts(sce), meta.data = data.frame(colData(sce)), min.cells = 0, min.features = 0)
seur
## An object of class Seurat 
## 19802 features across 2989 samples within 1 assay 
## Active assay: RNA (19802 features)
seur@assays$RNA@counts[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
##          V3 V4 V5 V6
## Tspan12   .  .  .  3
## Tshz1     3  1  .  2
## Fnbp1l    3  1  6  4
## Adamts15  .  .  .  .

Seurat also provides function to normalize data by a scaling factor followed by log-transformation.

seur <- NormalizeData(seur, normalization.method = "LogNormalize")
seur@assays$RNA@data[1:4,1:4]
## 4 x 4 sparse Matrix of class "dgCMatrix"
##                 V3        V4       V5        V6
## Tspan12  .         .         .        0.6665506
## Tshz1    0.8941375 0.3913325 .        0.4896053
## Fnbp1l   0.8941375 0.3913325 1.089693 0.8168434
## Adamts15 .         .         .        .

Extract highly variable genes

For extracting most informative genes one can use FindVariableGenes function. When using this function one can change the selection method (selection.method). For the illustrative purpose, we will use vst method which fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the gene expression values using the observed mean and expected variance (given by the fitted line). Gene variance is then calculated on the standardized values after clipping to a maximum (clip.max). For more details see help of FindVariableGenes function.

Note that in Seurat versions <3.1.1 function FindVariableFeatures is relaced by FindVariableGenes and there is no function VariableFeaturePlot.

seur <- FindVariableFeatures(object = seur, selection.method="vst", nfeatures = 2000)
top10 <- head(VariableFeatures(seur), 10)
plot1 <- VariableFeaturePlot(seur)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2

Scale the data

For centering and scaling dataset you can use ScaleData function. Setting do.center=TRUE will center the expression for each gene by subtracting the average expression for that gene. Setting do.scale=TRUE will scale the expression for each gene by dividing the centered gene expression levels by their standard deviations (if do.center=TRUE), and by their root mean square, otherwise.

seur <- ScaleData(object = seur, do.center = TRUE, do.scale = TRUE)
seur@assays$RNA@scale.data[1:4,1:4]
##                       V3         V4         V5          V6
## Tshz1          1.6518806  0.4900755 -0.4141557  0.71714932
## 2310042E22Rik -0.1752873 -0.1752873  2.5351143  3.39875734
## Sema3c         2.7277730 -0.4150941  3.3787651  0.05362024
## Jam2           0.6895176 -0.3571584  0.3997203 -0.35715840

Reduce dataset dimension and clusterize cells

In this step, we will perform PCA dimension reduction, build nearest neighbor graph (on the Euclidean distances between any pair of cells in the PCA space) and clusterize cells using modularity optimization technique (Louvain algorithm) that seeks densily connected modules (cell clusters) in a graph structure. Note that FindClusters do not require from the user to input true number of cell populations. However the clustering performance strongly depends on the resolution parameter that controls the size of graph communities, hence the size of the clusters (for more details see the description of FindClusters function).

seur <- RunPCA(object = seur, pc.genes=seur@var.genes, ndims.print = 1:2, nfeatures.print = 5) #Reducing dimension with PCA
seur <- FindNeighbors(seur) #Computing nearest neighbor graph
seur <- FindClusters(object = seur, resolution = 0.1, algorithm = 1) #Clustering cells with modularity detection algorithm on graph
table(Idents(seur))
## 
##   0   1   2   3   4   5   6   7   8 
## 787 676 403 342 320 178 160  63  60

Compare clusterization with annotated cell types

We can use UMAP dimension reduction to compare the Seurat clustering and annotated cell populations.

seur <- RunUMAP(seur, dims = 1:10)
DimPlot(seur, group.by = "ident", reduction.use = "umap")

DimPlot(seur, group.by = "level1class", reduction.use = "umap")

Compute Adjusted Rand Index

Finally, we will verify the accuracy of the clustering using ARI index.

adjustedRandIndex(Idents(seur), seur@meta.data$level1class)
## [1] 0.6986632

Exercise

How the clustering would change if you would manipulate the resolution parameter?

Complete data analysis using Monocle package

Create CellDataSet object

Here we will use Monocle to perform a complete analysis on the example dataset. Monocle uses its own data object - CellDataSet, which can store the information about the experiment (expression matrix and metadata). Note that when creating new CellDataSet object the data should not be normalized - the method will normalize matrix internally when performing downstream analysis steps. It is suggested to set expressionFamily=negbinomial.size() when working with UMI or raw read counts to define the underlying distribution of the dataset. For more details see: http://cole-trapnell-lab.github.io/monocle-release/docs/#installing-monocle.

rowData(sce)$gene_short_name <- rownames(sce)
pd <- new("AnnotatedDataFrame", data = data.frame(colData(sce))) #cell info
fd <- new("AnnotatedDataFrame", data = data.frame(rowData(sce))) #gene info
cds <- newCellDataSet(cellData = counts(sce), phenoData = pd, featureData = fd, expressionFamily=VGAM::negbinomial.size())  
cds
## CellDataSet (storageMode: environment)
## assayData: 19802 features, 2989 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: V3 V4 ... V3007 (2989 total)
##   varLabels: tissue group ... Size_Factor (58 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: Tspan12 Tshz1 ... Gm21943 (19802 total)
##   fvarLabels: is_feature_control is_feature_control_Spike ...
##     gene_short_name (13 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Extract highly variable genes

In the first step, we will call estimateDispersions function to compute a smooth curve describing the relationship between the variance of the genes and the mean expression across all the cells. dispersionTable retrieves table of computed mean, empirical dispersion and fitted dispersion values. From this table we can subset top variable genes by thresholding on the mean and empirical dispersion. Black dots on the mean/dispersion figure marks the selected features.

cds <- BiocGenerics::estimateSizeFactors(cds)
cds <- estimateDispersions(cds)
disp <- dispersionTable(cds)
head(disp)
##    gene_id mean_expression dispersion_fit dispersion_empirical
## 1  Tspan12      0.27716042       5.633211            10.023476
## 2    Tshz1      0.41719267       3.973567             8.826994
## 3   Fnbp1l      0.83724181       2.325529             3.491744
## 4 Adamts15      0.04898606      28.664478            83.258508
## 5   Cldn12      0.30621583       5.164049             5.294964
## 6    Rxfp1      0.03631543      38.425363            12.890793
hvg <- subset(disp, mean_expression >= 0.5 & dispersion_empirical >= 0.1)
cds <- setOrderingFilter(cds, ordering_genes = hvg$gene_id)
plot_ordering_genes(cds)

cds <- cds[hvg$gene_id,]

Reduce dataset dimension and clusterize cells

Before performing clustering, we will reduce the feature selected matrix down to two dimensions using tSNE. Note that Monocle3 also provides UMAP as reduction_method.

cds <- reduceDimension(cds, max_components = 2, reduction_method = 'tSNE', verbose = T)

Unsupervised clustering in Monocle can be done using one of two techniques, densityPeak or louvain. densityPeak partitions cells based on the cell local density and their nearest distance. It aims to group cells with a higher local density and distance smaller than a fixed threshold. The number of clusters in this technique is automatically estimated through the inference. On the contrary, louvain approach is based on a community detection algorithm that seeks densely connected modules in a graph built on scRNAseq dataset. Although it is possible to fix the number of nearest neighbors when creating the graph, the number of clusters cannot be directly controlled. For the illustrative purpose, we will clusterize cells with densityPeak algorithm.

cds <- clusterCells(cds, num_clusters = NULL, method = "densityPeak")
## Distance cutoff calculated to 3.26211
table(cds$Cluster)
## 
##   1   2   3   4   5 
## 740 667 781 518 283

Compare clusterization with annotated cell types

We can plot now the projected data colored by the clustering output and the available cell annotation. Note that plot will depend on the reduction_method you used previously (either tSNE or UMAP projected data will be plotted).

p1 <- plot_cell_clusters(cds, color_by = 'Cluster')
p2 <- plot_cell_clusters(cds, color_by = 'level1class')
p1

p2

Compute Adjusted Rand Index

Finally, we will verify the accuracy of the clustering using ARI index.

adjustedRandIndex(cds$Cluster,cds$level1class)
## [1] 0.6961306

Exercise

Would you improve the performance of the method by setting method = “louvain” ? What is the difference between the number of estimated clusters in comparison to densityPeak approach?

Package versions

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
## LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SC3_1.10.1                  scran_1.10.2               
##  [3] mclust_5.4.3                Seurat_3.1.1               
##  [5] monocle_2.99.2              L1Graph_0.1.1              
##  [7] lpSolveAPI_5.5.2.0-17       DDRTree_0.1.5              
##  [9] irlba_2.3.3                 igraph_1.2.4               
## [11] Matrix_1.2-17               M3Drop_1.10.0              
## [13] numDeriv_2016.8-1           scater_1.10.1              
## [15] ggplot2_3.1.0               SingleCellExperiment_1.4.1 
## [17] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
## [19] BiocParallel_1.16.6         matrixStats_0.54.0         
## [21] Biobase_2.42.0              GenomicRanges_1.34.0       
## [23] GenomeInfoDb_1.18.2         IRanges_2.16.0             
## [25] S4Vectors_0.20.1            BiocGenerics_0.28.0        
## 
## loaded via a namespace (and not attached):
##   [1] reticulate_1.11.1        R.utils_2.8.0           
##   [3] tidyselect_0.2.5         htmlwidgets_1.3         
##   [5] grid_3.5.1               combinat_0.0-8          
##   [7] docopt_0.6.1             Rtsne_0.15              
##   [9] munsell_0.5.0            codetools_0.2-16        
##  [11] ica_1.0-2                units_0.6-2             
##  [13] umap_0.2.0.0             statmod_1.4.30          
##  [15] future_1.14.0            miniUI_0.1.1.1          
##  [17] withr_2.1.2              colorspace_1.4-1        
##  [19] fastICA_1.2-1            knitr_1.22              
##  [21] rstudioapi_0.10          ROCR_1.0-7              
##  [23] robustbase_0.93-4        pbmcapply_1.3.1         
##  [25] gbRd_0.4-11              listenv_0.7.0           
##  [27] labeling_0.3             Rdpack_0.10-1           
##  [29] slam_0.1-45              bbmle_1.0.20            
##  [31] GenomeInfoDbData_1.2.0   pheatmap_1.0.12         
##  [33] rhdf5_2.26.2             coda_0.19-2             
##  [35] LearnBayes_2.15.1        xfun_0.5                
##  [37] R6_2.4.0                 doParallel_1.0.14       
##  [39] ggbeeswarm_0.6.0         rsvd_1.0.2              
##  [41] VGAM_1.1-1               locfit_1.5-9.1          
##  [43] manipulateWidget_0.10.0  bitops_1.0-6            
##  [45] assertthat_0.2.1         promises_1.0.1          
##  [47] SDMTools_1.1-221         scales_1.0.0            
##  [49] nnet_7.3-12              beeswarm_0.2.3          
##  [51] gtable_0.3.0             npsurv_0.4-0            
##  [53] Cairo_1.5-10             globals_0.12.4          
##  [55] rlang_0.4.0              splines_3.5.1           
##  [57] lazyeval_0.2.2           acepack_1.4.1           
##  [59] checkmate_1.9.1          rgl_0.100.19            
##  [61] yaml_2.2.0               reshape2_1.4.3          
##  [63] crosstalk_1.0.0          backports_1.1.3         
##  [65] httpuv_1.5.0             Hmisc_4.2-0             
##  [67] tools_3.5.1              spData_0.3.0            
##  [69] gplots_3.0.1.1           RColorBrewer_1.1-2      
##  [71] dynamicTreeCut_1.63-1    ggridges_0.5.1          
##  [73] Rcpp_1.0.1               plyr_1.8.4              
##  [75] base64enc_0.1-3          zlibbioc_1.28.0         
##  [77] classInt_0.3-1           purrr_0.3.2             
##  [79] RCurl_1.95-4.12          densityClust_0.3        
##  [81] rpart_4.1-13             deldir_0.1-16           
##  [83] reldist_1.6-6            pbapply_1.4-0           
##  [85] viridis_0.5.1            cowplot_0.9.4           
##  [87] zoo_1.8-5                ggrepel_0.8.0           
##  [89] cluster_2.0.7-1          magrittr_1.5            
##  [91] RSpectra_0.13-1          data.table_1.12.0       
##  [93] gmodels_2.18.1           lmtest_0.9-36           
##  [95] RANN_2.6.1               mvtnorm_1.0-10          
##  [97] fitdistrplus_1.0-14      lsei_1.2-0              
##  [99] mime_0.6                 evaluate_0.13           
## [101] xtable_1.8-3             sparsesvd_0.1-4         
## [103] gridExtra_2.3            HSMMSingleCell_1.2.0    
## [105] compiler_3.5.1           tibble_2.1.1            
## [107] KernSmooth_2.23-15       crayon_1.3.4            
## [109] R.oo_1.22.0              htmltools_0.3.6         
## [111] pcaPP_1.9-73             mgcv_1.8-28             
## [113] later_0.8.0              spdep_1.0-2             
## [115] Formula_1.2-3            rrcov_1.4-7             
## [117] tidyr_0.8.3              expm_0.999-4            
## [119] RcppParallel_4.4.2       DBI_1.0.0               
## [121] WriteXLS_4.1.0           MASS_7.3-51.3           
## [123] sf_0.7-3                 boot_1.3-20             
## [125] R.methodsS3_1.7.1        gdata_2.18.0            
## [127] metap_1.1                pkgconfig_2.0.2         
## [129] registry_0.5-1           foreign_0.8-71          
## [131] sp_1.3-1                 plotly_4.8.0            
## [133] foreach_1.4.4            vipor_0.4.5             
## [135] rngtools_1.3.1           pkgmaker_0.27           
## [137] webshot_0.5.1            XVector_0.22.0          
## [139] bibtex_0.4.2             doRNG_1.7.1             
## [141] stringr_1.4.0            digest_0.6.18           
## [143] tsne_0.1-3               sctransform_0.2.0       
## [145] RcppAnnoy_0.0.11         rmarkdown_1.12          
## [147] leiden_0.3.1             htmlTable_1.13.1        
## [149] edgeR_3.24.3             uwot_0.1.4              
## [151] DelayedMatrixStats_1.4.0 shiny_1.2.0             
## [153] gtools_3.8.1             nlme_3.1-137            
## [155] jsonlite_1.6             Rhdf5lib_1.4.3          
## [157] BiocNeighbors_1.0.0      viridisLite_0.3.0       
## [159] limma_3.38.3             pillar_1.4.2            
## [161] lattice_0.20-38          ggrastr_0.1.7           
## [163] DEoptimR_1.0-8           httr_1.4.0              
## [165] survival_2.44-1.1        glue_1.3.1              
## [167] qlcMatrix_0.9.7          FNN_1.1.3               
## [169] png_0.1-7                iterators_1.0.10        
## [171] glmnet_2.0-16            class_7.3-15            
## [173] stringi_1.4.3            HDF5Array_1.10.1        
## [175] latticeExtra_0.6-28      caTools_1.17.1.2        
## [177] dplyr_0.8.0.1            e1071_1.7-1             
## [179] future.apply_1.3.0       ape_5.3